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Predictions of missing links of incomplete networks like protein-protein interaction networks or 
very likely but not yet existent links in evolutionary networks like friendship networks in web society 
can be considered as a guideline for further experiments or valuable information for web users. In 
this paper, we introduce a local path index to estimate the likelihood of the existence of a link 
between two nodes. We propose a network model with controllable density and noise strength in 
generating links, as well as collect data of six real networks. Extensive numerical simulations on 
both modeled networks and real networks demonstrated the high effectiveness and efficiency of the 
local path index compared with two well-known and widely used indices, the common neighbors 
and the Katz index. Indeed, the local path index provides competitively accurate predictions as 
the Katz index while requires much less CPU time and memory space, which is therefore a strong 
candidate for potential practical applications in data mining of huge-size networks. 
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I. INTRODUCTION 

Many complex systems can be well described by net- 
works where nodes present individuals or agents, and 
links denote the relations or interactions between nodes. 
Complex network is therefore becoming an useful tool in 
analyzing a wide range of complex systems. Recently, 
the understanding of structure, evolution and function 
of networks has attracted much attention from physics 
community [l|, 0, 0, [H . Another important scientific is- 
sue relevant to network analysis, namely the Information 
Retrieval 0,0] ^ however, received less attention. Origi- 
nally, Information Retrieval aims at finding material of 
an unstructured nature that satisfies an information need 
from large collections [S] . It can be also viewed as dealing 
with prediction of links between words and documents, 
and is now further extended to standing for a number 
of problems on link mining Q. Actually, link prediction 
problem is a long-standing challenge in modern informa- 
tion science, and a lot of algorithms have been proposed 
based on Markov chains and machine lear ning processes 
by computer science community [Io|, [O, [H, IU0 • How- 
ever, their works have not caught up the current progress 
of the study of complex networks, especially they lack 
serious consideration of the structural characteristics of 
networks which may indeed provide useful information 
and insights for link prediction. 

The problem of link prediction aims at estimating the 
likelihood of the existence of a link between two nodes, 
based on observed links and the attributes of nodes. It 
can be categorized into two classes: One is the prediction 
of missing links in sampling networks, such as the food 
webs and the world wide webs; the other is the prediction 
of links that may exist in the future of evolving networks, 
like the on-line social networks. In addition, the link pre- 
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diction algorithms (or other algorithms based on similar 
techniques) can also be applied to solve the link classi- 
fication problem in partially labeled networks [H, 
such as the prediction of protein functions [lH and to 
distinguish the research areas of scientific publications 

Up to now, most of the algorithms are designed ac- 
cording to the definition of node similarity. Node sim- 
ilarity can be defined just using the essential attributes 
of nodes, namely two nodes are considered to be simi- 
lar if they have many common features [13] • Another 
group of similarity indices is based solely on the network 
structure, which is called structural similarity and can 
be further classified as node-dependent, path-dependent 
and mixed methods. An introduction and comparison 
of some similarity indices is presented in Ref. [I^ in 
which the Common Neighbors [l^, Jaccard coefficient 
[2O], Adamic-Adar Index [2l| and Preferential Attach- 
ment [22^ are classified to be the node-dependent indices, 
while Katz Index [l^ , Hitting Time [2^] , Commute Time 
[25[, Rooted PageRank [2^, SimRank [23] and Blondel 
Index [28| are classified to be the path-dependent indices. 
Besides, Leicht, Holme and Newman proposed a measure 
to quantify the node similarity based on the assumption 
that two nodes are similar if their immediate neighbors 
in the network are themselves similar [2^. This leads to 
a self-consistent matrix formulation of similarity that can 
be evaluated iteratively using the adjacency matrix. This 
similarity index can also be considered as a candidate for 
accurate link prediction. 

Besides the similarity-based prediction algorithms, 
some more complicated methods are proposed recently. 
Clauset, Moore and Newman proposed an algorithm 
based on the hierarchical network structure [sO, Isij . 
Firstly, they use a hierarchical random graph to statis- 
tically fit the real network data. Then the dependence 
of the lateral-connection probability on the depth of the 
nodes in the hierarchy can be inferred. Finally, one can 
predict the missing links of the network according to the 
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lateral-connecting probability by ranking them in the de- 
scending order. Furthermore, many efforts have been 
done for designing the recommender systems [32| . Actu- 
ally, the process of recommending items to a user can be 
considered as the prediction of missing links in the user- 
item bipartite network [s^. Especially, physicists have 
recently proposed some information recommendation al- 
gorithms based on physical processes, such as energy dif- 
fusion [sl, [35I, [3^ and heat conduction [s^ • Although the 
relevant issue has not been fully explored, it highlights 
a possibility to improve the accuracy and efficiency of 
link prediction algorithms by applying classical physics 
dynamics. 

There are many difficulties for the studies of link pre- 
diction. One is the sparsity of the target networks 
[H, m, [4^, which leads to a serious problem that the 
prior probability of a link is typically quite small, re- 
sulting in large difficulties in building statistical mod- 
els. The other problem is the huge size of real systems 
that requires highly efficient algorithms. However, the 
complexity of computational time and memory, being a 
crucial factor in real applications, has not been system- 
atically investigated. Generally speaking, the accuracy 
of an algorithm and its computational complexity have 
positive correlation, namely higher accuracy usually im- 
plies higher complexity. Note that, any highly accurate 
algorithm will become meaningless if the consuming time 
or memory is unacceptable. Therefore, designing an ac- 
curate and fast algorithm is a big challenge, especially 
for sparse and huge networks. 

In this paper, we introduce a so-called local path index 
to characterize the node similarity. Extensive numerical 
simulations on both modeled networks and real networks 
demonstrate that this similarity index is simultaneously 
highly effective (its prediction accuracy is much higher 
than the common neighbors, and competitive with the 
Katz index) and highly efficient (the time and space re- 
quired to compute it are much less than those for the 
Katz index). Especially, when the network is huge, the 
local path index shows great advantage compared with 
the Katz index since computing the latter asks for a CPU 
time scaling as cube of the network size while computing 
the former requires a linear CPU time as the network size. 
We therefore think this local path index is a strong can- 
didate for potential practical applications in data mining 
of huge-size complex networks. 



II. METHOD 

Considering an unweighted undirected simple network 
G(y,^), where V is the set of nodes and E is the set 
of links. The multiple links and self-connections are not 
allowed. For each pair of nodes, x,y e V, we assign a 
score, Sxy. Since G is undirected, the score is supposed 
to be symmetry, say Sxy = Syx. All the nonexistent links 
are sorted in decreasing order according to their scores, 
and the links in the top are most likely to exist. In this 
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FIG. 1: (Color online) Prediction accuracy vs. the strength 
of randomness for three similarity indices: CN (circles), LP 
(triangles) and Katz index (squares). The network size, 
N = 1000, and the degree, k = 10, are fixed. Each data 
point is obtained by averaging over 10 independent realiza- 
tions. When approaching the purely random case, p = 1, the 
accuracies of CN and LP go below 0.5, which is an artifact 
of the specific constrain on identical degree. That is, in the 
purely random case, two unconnected nodes with higher de- 
grees in the training set are of less probability to be connected 
in the probe set since the total degree is identical for every 
node, however, they generally have more common neighbors 
and thus higher similarity. 

paper, we adopt the simplest framework, that is, to di- 
rectly set the similarity as the score, so the higher score 
means the higher similarity, and vice versa. In some link 
prediction algorithms, the scores may be not directly re- 
lated to a certain similarity measurement, but describe 
the existence likelihood of links [10, lU, El, E, H, ll^, 
and in some other algorithms, scores may be generated 
by an integration of some similarities of node pairs in the 
neighborhood of the target links, such as the collabora- 
tive filtering method [4l[. 

In this paper, we compare the prediction accuracies 
and computational complexity of three similarity indices: 
Common Neighbors (CN), Katz Index and a newly pro- 
posed similarity index, namely Local Path Index (LP in- 
dex or LP for short). Their definitions and relevant mo- 
tivations are introduced as follows: 

(\) Common Neighbors, which is also called structural 
equivalence in Ref. [19]. In common sense, two nodes, x 
and y, are more likely to form a link in the future if they 
have many common neighbors. For a node x, let T{x) 
denote the set of neighbors of x. The simplest measure 
of the neighborhood overlap is the directed count: 

s^y = \V{x)f\V{y)l (1) 

where \Q\ is the cardinality of the set Q. It is obvious 
that Sxy = {A^)xy^ where A is the adjacency matrix, in 
which Axy = 1 if X and y are directly connected and 
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FIG. 2: (Color online) Prediction accuracy vs. network den- 
sity for three similarity indices: CN (circles), LP (triangles) 
and Katz index (squares). Since in this model, every node has 
the same degree, we therefore directly use degree to denote 
the network density. The network size, N = 1000, and the 
strength of randomness, p — 0.2, are fixed. Each data point 
is obtained by averaging over 10 independent realizations. 
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FIG. 3: (Color online) An illustration of time complexity in 
calculating CN and LP indices, (a) A fully connected network 
with four nodes as the example, (b) Lists of the neighborhood 
of each node, (c) Process of how to determine all the similar- 
ities relevant to node 1. 



^xy = otherwise. Note that, {A^)xy is also the number 
of different paths with length 2 connecting x and y. New- 
man ^42|] used this quantity in the study of collaboration 
networks, showing the correlation between the number of 
common neighbors and the probability that two scientists 
will collaborate in the future. Some more complicated 
measures, such as Salton Index 0, Jaccard Index [l^, 
S0rensen Index [i^ and Adamic-Adar Index [2i[, can 
also be categorized into CN-based measures. However, 
recently, extensive empirical analysis has demonstrated 
that the simplest CN (i.e., Eq. (1)) performs even better 
than those complicated variants [11,13. Therefore, we 
here select CN as the representative of all CN-based mea- 
sures. Although CN consumes little time and performs 
relatively good among many local indices, due to the in- 
sufficient information, its accuracy can't catch up with 
the measures based on global information. One typical 
example is the Katz Index [2^. 

(ii) Katz Index. This measure is based on the ensemble 
of all paths, which directly sums over the collection of 
paths and exponentially damped by length to give the 
short paths more weights. The mathematical expression 
reads 

CO 

Sxy = ^P' -\paths<l>l (2) 
1=1 

where paths^y^ is the set of all paths with length / con- 
necting x and and is a free parameter controlling the 
weights of the paths. Obviously, a very small P yields a 
measure close to CN, because the long paths contribute 
very little. The S matrix can be written as {I—l3A)~^ —I. 
Note that, p must be lower than the reciprocal of the 



maximum of the eigenvalues of matrix A to ensure the 
convergence of Eq. (2). 

{in) Local Path Index. To provide a good tradeoff of 
accuracy and complexity, we here introduce an index that 
takes consideration of local paths, with wider horizon 
than CN. It is defined as 

5 = + eA^ (3) 

where S denotes the similarity matrix and e is a free pa- 
rameter. Clearly, this measure degenerates to CN when 
e = 0. And if x and y are not directly connected (this is 
the case we are interested in), {A^)xy is equal to the num- 
ber of different paths with length 3 connecting x and y. 
Although it needs more information than CN, it is still a 
local measure of relatively lower complexity than global 
ones. 

Choosing these three indices for comparison is be- 
cause they all can be classified to path-dependent sim- 
ilarities with unified form as Sxy = ' {paths^y^ \ , 
where for CN, / = 2; for LP, / = 2,3; and for Katz, 
/ = 1,2,3,-- - ,oo. Since we are only interested in the in- 
directly connected node pairs, Katz Index can be treated 
as a measure considering / = 2,3,-- - ,oo. Note that, all 
these three indices are used to quantify the structural 
equivalence, with an latent assumption that the link it- 
self indicated a similarity between two endpoints (see, 
for example, the Leicht-Holme-Newman index [29j] and 
transferring similarity [45] ) . An issue worth future explo- 
ration is whether a certain similarity measure on regular 
equivalence (see Ref. |46| for the mathematical definition 
of regular equivalence and Ref. [l5[ for a recent applica- 
tion on the prediction of protein functions) can provide 
better predictions. 
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To test the algorithmic accuracy, the observed hnks, 
is randomly divided into two parts: the training set, ^ 
is treated as known information, while the probe set, E^ , 
is used for testing and no information in the probe set is 
allowed to be used for prediction. Clearly, E = E^ U E^ 
and E^ f] E^ = 0. In this paper, the training set always 
contains 90% of links, and the remaining 10% of links 
constitute the probe set. We use a standard metric, area 
under the receiver operating characteristic (ROC) curve 
[13], to quantify the accuracy of prediction algorithms. 
In the present case, this metric can be interpreted as the 
probability that a randomly chosen missing link (a link 
in E^) is given a higher score than a randomly chosen 
nonexistent link (a link in U \ E, where U denotes the 
universal set). In the implementation, among n times of 
independent comparisons, if there are n' times the miss- 
ing link having higher score and n" times the missing link 
and nonexistent link having the same score, we define the 
accuracy as ^ ^n'^^ • scores are generated from 

an independent and identical distribution, the accuracy 
should be about 0.5. Therefore, the degree to which the 
accuracy exceeds 0.5 indicates how much better the al- 
gorithm performs than pure chance. Readers are encour- 
aged to see the Refs. [isl, |49| for more information about 
how to evaluate the accuracy of prediction algorithms. 
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FIG. 4: (Color online) A log-log plot about how the computa- 
tional time (in microsecond) depends on the network size for 
three indices, CN (circles), LP (triangles) and Katz (squares). 
The node degree, k = 10, and the strength of randomness, 
p = 0.2, are fixed. Each data point is obtained by averaging 
over 10 independent realizations. All computations were car- 
ried out in a desktop computer with a single Intel (R) Xeon 
(TM) processor (3.00 CHz) and 2CB EMS memory. 



III. MODEL 

In this section, we compare the three similarity in- 
dices in modeled networks with controllable density and 
randomness. Although the real networks have complex 
structural properties [5], such as the community struc- 
ture, the mixing pattern and the rich-club phenomenon, 
as a start point, we only consider a very simple model, 
and to eliminate the effect of degree heterogeneity, we as- 
sume that every node has an identical degree, k. In this 
model, each node is characterized by a 10-dimensional 
vector with each element a randomly selected real num- 
ber in the interval (—1,1). This vector represents the 
node's intrinsic features, such as the attributes of an ob- 
ject and the profiles of a person. Two nodes are consid- 
ered to be similar and thus of high probability to con- 
nect to each other if they share many close attributes. 
Therefore, we define the intrinsic similarity between two 
nodes as the scalar product of the corresponding vectors, 
namely 

^xy — fx ' fy — Syxi (4) 

where fx is the vector of node x, and the superscript 
emphasizes that this similarity is intrinsic and can not 
be observed in the real systems. 

Given the network size, TV, and the degree of each 
node, /c, this model starts with an empty network but 
N nodes, that is, each node is of degree zero. At each 
time step, a node with the smallest degree is randomly 
selected (generally, there are more than one node having 



the smallest degree). Among all other nodes whose de- 
grees are smaller than /c, this selected node will connect 
to the most similar node with probability 1 — p, while 
a randomly chosen one with probability p. This process 
will terminate when all nodes are of degree k. The pa- 
rameter p G [0, 1] represents the strength of randomness 
in generating links, which can be understood as noise or 
irrationality that exists in almost every real system. 

In Fig. 1 and Fig. 2, we report the comparison of algo- 
rithmic accuracy for those three similarity indices. Data 
points are corresponding to the optimal values of (5 (for 
Katz index) or e (for LP index) subject to the highest 
accuracies. Clearly, both Katz index and LP index per- 
form remarkable better than the simple CN index. As 
shown in Fig. 1, when the strength of randomness/noise 
is weak, LP index gives competitive result as Katz index, 
while for highly noisy cases, LP index performs even bet- 
ter. Whatever the similarity index, a link prediction al- 
gorithm is expected to give higher accuracy for a denser 
network, which is in accordance with what observed in 
Fig. 2. In the area with lacking information (i.e., small 
k) or rich information (i.e., large /c), LP index performs 
slightly better than CN index, while in the middle with 
typical degree as the real networks, LP index can perform 
much better than the CN index. 

The reason why the CN index performs remarkably 
poorer than LP index is that the probability that two 
node pairs are assigned the same similarity by CN is 
high. That is to say, CN index is less distinguishable, 
especially in the relatively sparse networks. For exam- 
ple, in the case N = 1000, p = and k = 10, there 
are about 5 x 10^ node pairs, 94.01% of which are as- 
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FIG. 5: (Color online) A log-log plot about how the computa- 
tional time (in microsecond) depends on the node degree for 
three indices, CN (circles), LP (triangles) and Katz (squares). 
The network size, N — 1000, and the strength of randomness, 
p = 0.2, are fixed. Each data point is obtained by averaging 
over 10 independent realizations. The hardware environment 
is the same as what we stated in the caption of Figure 4. 



is simply the time complexity in calculating CN index 
is 0{Nk^). Analogously, for LP index, what we need to 
do is go one step further (called the step 3) to check all 
neighbors of each of x's second-order neighbors, respec- 
tively. If a node y appears n times in x's second-order 
neighborhood and m times in x's third-order neighbor- 
hood, 5 



xy 



n + em. Therefore, the time complexity in 
calculating the LP index is 0{Nk^). An detailed illus- 
tration for an example network consisted of four nodes is 
shown in Fig.O For the Katz index, the time complexity 
is mainly determined by the matrix inversion operator, 
which is 0{N^) [l^. In Fig. 4 and Fig. 5, we report the 
numerical results about computational complexity of the 
three similarity indices, which are well in accordance with 
the analysis. Beside the time complexity, memory space 
is another limitation for algorithmic implementation for 
huge-size networks. In calculating CN and LP indices, 
the memory required are of the order 0{Nk)^ while for 
the Katz index, it is of the order 0(A^^). In a word, 
compared with the widely applied CN index and Katz 
index, the LP index is not only highly effective (i.e., ac- 
curate), but also highly efficient (i.e., required relatively 
less memory and CPU time). 



signed zero score, and for all non-zero scores, 79.87% are 
1. As shown later, the real cases may be even worse, 
for instance, in a router-level Internet with 5022 nodes, 
99.59% of node pairs are assigned zero score by CN, while 
for all those non-zero scores, 91.11% of which are assigned 
score one. The additional information involving the next 
nearest neighbors introduced by LP index can make the 
similarities much more distinguishable, thus remarkably 
enhance the accuracy. Note that, if the maximal num- 
ber of paths with length three connecting two arbitrary 
nodes is Pmax^ any e in the interval (0, — ) will give out 
exactly the same predictions. Therefore, the prediction 
accuracy for LP index is not sensitive to the parameter 
e when e is not so large. Indeed, setting e as a small 
positive number like 0.01 one can obtain a near optimal 
accuracy, usually less than 1% smaller than the real op- 
timum (see also Table II, where we compare the optimal 
AUC values with the values obtained by setting e = 0.01 
for the six real networks). In finding the optimal value of 
/3, one can first calculate the maximal eigenvalue of the 
adjacency matrix, and the optimal /3 is always smaller 
than its reciprocal. It is then easy to approach the opti- 
mal p. For example, the optimal values of p for relevant 
data points shown in Fig. 1 and Fig. 2 are all no larger 
than 0.03. 

Next, we discuss the computational complexities of the 
three similarity indices. In calculating the CN index, for 
each node, denoted by x, we first search all x's neighbors 
(called the step 1), and then lay out the neighbors of 
each of x's neighbors, respectively (called the step 2). If 
a node y appears n times in the step 2, Sxy = n. Since the 
time complexity to traverse the neighborhood of a node 



IV. EMPIRICAL ANALYSIS 

In this paper, we consider six representative networks 
drawn from disparate fields: (i) PPL — A protein-protein 
interaction network containing 2617 proteins and 11855 
interactions [5l[ . Although this network is not connected 
(it contains 92 components), most of nodes belong to the 
giant component, whose size is 2375. (ii) NS. — A net- 
work of coauthorships between scientists who are them- 
selves publishing on the topic of networks [52[ . This net- 
work contains 1589 scientists, and 128 of which are iso- 
lated. Here we do not consider those isolated nodes. The 
connectivity of NS is not good. It is consisted of 268 con- 
nected components, and the size of the largest connected 
component is only 379. (iii) Grid. — An electrical power 
grid of western US js^, with nodes representing genera- 
tors, transformers and substations, and links correspond- 
ing to the high voltage transmission lines between them. 
This network contains 4941 nodes and is well connected, 
(iv) PB.— A network of the US political blogs The 
original links are directed, here we treat them as undi- 
rected ones. PB has 1224 nodes and the giant component 
contains 1222 nodes, (v) INT. — The router-level topol- 
ogy of the Internet, which is collected by the Rocketfuel 
Project [HH. INT has 5022 nodes and is well connected, 
while it is an extremely sparse network with average de- 
gree being only 2.49. (vi) USAir. — the network of US air 
transportation system, which contains 332 airports and 
2126 airlines [5y]. Note that, all the similarity indices 
considered here, as well as those well-known indices (ex- 
cept the preferential attachment index) reported in Refs. 
[3, [3] 5 will give zero score to a pair of nodes located in 
two disconnected components. Therefore, here we only 
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TABLE L The basic topological features of the giant compo- 
nents of the six example networks. N and M are the total 
numbers of nodes and links, respectively, (k) is the average 
degree of the network, (d) is the average shortest distance be- 
tween node pairs. C and r are clustering coefficient [s^] and 
assortative coefficient [sj, respectively. Nodes with degree 1 
are excluded from the calculation of clustering coefficient. H 
is the degree heterogeneity, defined as = {Ip"? where {k) 
denotes the average degree. 



Networks N M (k) (d) C r H 

PPI 2375 11693 9.847 4.59 0.388 0.454 3.476 

NS 379 941 4.823 4.93 0.798 -0.082 1.663 

Grid 4941 6594 2.669 15.87 0.107 0.003 1.450 

PB 1222 16717 27.360 2.51 0.360 -0.221 2.970 

INT 5022 6258 2.492 5.99 0.033 -0.138 5.503 

USAir 332 2126 12.807 2.46 0.749 -0.208 3.464 



consider the giant component, and when preparing the 
probe set, we also make sure that the remain training set 
representing a connected network. Actually, each time 
before removing of a link to the probe set, we first check 
if this removal will make the training network discon- 
nected. Table 1 summarizes the basic topological fea- 
tures of the giant component of those networks. Brief 
definitions of the monitored topological measures can be 
found in the table caption, for more details, please see 
the review articles [H, [1, H, i, . 

We apply the link prediction algorithm on the six real 
networks, and the accuracies is shown in Table 2, with 
those entries corresponding to the highest accuracies be- 
ing emphasized by black. Clearly, the LP index always 
performs better than the CN index, especially, for INT, 
the AUG is sharply improved from 0.653 to 0.943. Ex- 
cept Grid, the LP index gives competitively accurate 
predictions as the Katz index. Grid is a strongly local- 
ized network with most of links being of short geograph- 
ical lengths, and thus the average topological distance 
of Grid, (d) = 15.87, is much larger than the other five 
example networks. Although Grid is geographically lo- 
calized, the clustering coefficient is relatively small and 
it lacks short loops since such loops are redundant and of 
lower efficiency in the engineering viewpoint. Actually, in 
Grid, when a link is removed, it is usually hard to find a 
very short path (like of length 2 or 3) connecting the two 
endpoints. Therefore, the CN and LP indices, consider- 
ing only very short paths, fail to re-find the correlation 
between two directly connected nodes if the link is re- 
moved. In addition, we note that the optimal value of e 
for USAir is negative. In USAir, the large-degree nodes 
are densely connected and share many common neigh- 
bors. Even without the contribution of eA^, the links 
among large-degree nodes are assigned very high scores, 
thus the additional item, eA^, changes little of their rel- 
ative positions. Considering two small local airports, x 
and which are connected to their local central airports. 



TABLE II: Accuracies of the three similarity indices, mea- 
sured by the area under the ROC curve (AUG). Each num- 
ber is obtained by averaging over 10 independent realizations. 
The entries corresponding to the highest accuracies are em- 
phasized by black. For LP and Katz indices, the AUG values 
are corresponding to the optimal parameter. LP* denotes the 
LP index with a fixed parameter e = 0.01. The very small dif- 
ference between the optimal case and the case with e = 0.01 
suggests that in the real application, one can directly set e as 
a very small number, instead of finding out its optimum that 
may cost much time. 
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^For USAir, the optimal value of e is negative. See the explanation 
in text. 

^For USAir, we set e = -0.01. 



and y\ Of course, many hubs are common neighbors 
of x' and y\ and x' and y' may be directly connected. 
If the link is removed, the similarities between x 

and other nodes are all zero. Otherwise, the similarities 
Sxy' (by x-x'-huh-y'), s^y (by x-x'-y'-y), and Sj^h where 
h represents a hub node (by x-x' -\mb-h or x-x'-y'-h) are 
positive due to the contributions of paths with length 3. 
There are many links connecting small local airports and 
local centers, some of which are removed, and the oth- 
ers are kept in the testing set. According to the above 
discussion, the removed links have lower score than the 
nonexistent links due to the additional item eA^ . In a 
word, the very specific structure of USAir (the hierar- 
chical organization consisted of hubs, local centers and 
small local airports) makes the LP index with positive e 
worse than the simple CN corresponding to e = 0, which 
is also the reason why negative e performs even better. 

Table 3 presents the computation time of the link pre- 
diction algorithm on the three similarity indices. Clearly, 
CN costs the least. Note that, the computational com- 
plexity in calculating the LP index is very sensitive to 
the average degree, while the one in calculating the Katz 
index is very sensitive to the network size. Therefore, 
the algorithm using LP index has great superiority for 
the huge-size and sparse networks compared with the 
one adopting the Katz index. Take INT as an exam- 
ple, the algorithm using the Katz index runs about one 
day while the one using the LP index takes less than half 
minute. Since the real challenge on computational com- 
plexity is always relevant to the huge-size real networks, 
which are mostly very sparse [l[, the LP index is much 
more practical than the Katz index. As a final remark, 
one may concern that whether to employ higher-order 
paths is worthwhile in practice, like to define a similarity 



7 



TABLE III: Computation time (in microsecond) of the link 
prediction algorithm on the three similarity indices of the six 
example networks. The hardware environment is the same as 
what we stated in the caption of Figure 4. 



Nets 


PPI 


NS 


Grid 


PB 


INT 


USAir 


CN 


10690 


253 


5161 


31112 


6711 


2208 


LP 


543589 


1638 


11344 


2873403 


27641 


93892 


Katz 


8073316 


27479 


69961063 


1051528 


72550935 


17603 



TABLE IV: Comparison of the accuracies of the original local 
path index (n = 3, see Eq. (3)) and the higher-order local 
path index (n = 4, see Eq. (5)), measured by the area under 
the ROC curve (AUC). Each number is obtained by averaging 
over 10 independent realizations. The AUC values reported 
here are corresponding to the optimal parameter. The average 
shortest distance and the improvement (%) by considering 
higher-order paths are also laid out in this Table, and all 
the six real networks are ordered by their shortest average 
distances. 



Nets 


USAir 


PB 


PPI 


NS 


INT 


Grid 


id) 


2.46 


2.51 


4.59 


4.93 


5.99 


15.87 


n = 3 


0.960 


0.941 


0.970 


0.988 


0.943 


0.697 


n = 4 


0.959 


0.937 


0.973 


0.989 


0.959 


0.759 


Improvement 


-0.104 


-0.425 


0.309 


0.101 


1.70 


8.90 



index in the form 

S = A^^eA^^e^A\ (5) 
We give a brief discussion on this issue in Appendix A. 

V. CONCLUSION AND DISCUSSION 

In this paper, we introduced a local path index to esti- 
mate the likelihood of the existence of a link between two 
nodes. We propose a network model with controllable 
density and noise strength in generating links. The LP 
index provides slightly more accurate predictions than 
the Katz index, especially in the highly noisy cases. We 
further use six representative real networks to test the 
three similarity indices, showing that the LP index can 
provide competitively accurate predictions as the Katz 
index. Compared with the Katz index, the LP index re- 
quires much less CPU time and memory space, and is 
therefore more practical. Ignored the degree-degree cor- 
relation, the time complexities in calculating LP index 
and Katz index are 0{N{k)^) and 0(A^^), respectively. 
Hence for the huge (i.e., very large N) and sparse (i.e., 
very small average degree (k)) networks, the advantage 
of the LP index is striking. 

Highly accurate predictions are significant in prac- 
tice. For example, many biological networks, such as 
protein-protein interaction networks, metabolic networks 



and food webs, the discovery of links/interactions costs 
much in the laboratory or the field. Instead of blindly 
checking all possible interactions, to predict in advance 
based on the interactions known already and focus on 
those links most likely to exist can sharply reduce the 
experimental costs if the predictions are accurate enough 
[30, l3l| • For some others like the friendship networks in 
web society, very likely but not yet existent links can be 
suggested to the relevant users as recommendations of 
promising friendships. These recommendations can help 
users finding new friends and thus enhance their loyalties 
to the web sites. Besides the practical significance, it is 
worthwhile to emphasize that the study of link predic- 
tion can also provide some theoretical insights about the 
structural organization. For example, in this paper, the 
unexpected results on Grid and USAir give evidence to 
some specific structural properties that are not straight- 
forwardly notable. Another example is that the preferen- 
tial attachment index usually gives poor predictions, and 
when it works relatively good, it implies that the testing 
network has strong rich-club phenomenon [4l|, [3]. Al- 
though the focus of this paper is not to investigate the 
relations between suitable similarity indices and network 
structures, we believe it is an interesting issue worth fur- 
ther studies. 

In this paper, we only considered the link prediction 
problem in static networks. However, many real net- 
works are evolving all the time, and the links created 
in different times should be assigned different weights in 
principle. This time-involved link prediction problem is 
rarely investigated and of course worths a serious study 
in the future [58]. Most of previous studies in relevant 
direction only test the algorithmic accuracy in real net- 
works. Here we argue that the modeled networks should 
be used, because one can control some meaningful pa- 
rameters in a model, which can not be directly observed 
in the real networks (e.g., the strength of noise or irra- 
tionality). We hope the proposed model could become a 
prototype in testing the accuracy of link prediction algo- 
rithms, however, it is currently too simple and to make it 
closer to the real networks, such as introducing control- 
lable degree heterogeneity and degree-degree correlation, 
is very helpful. 

This paper concerns only the simple networks, how- 
ever, the local path index can be easily extended to more 
complicated cases. For example, we can handle the di- 
rected networks by replacing the original adjacency ma- 
trix, A, by an asymmetry one, the weighted networks 
by replacing A by a weighted matrix, and the networks 
with self connections by assigning nonzero diagonal ele- 
ments. Actually, Murate and Moriyasu [59] have already 
investigated the link prediction problem in weighted net- 
works, however, the credibility of their work is recently 
challenged by the empirical evidence that the weak ties 
may play a more important role in link prediction than 
the strong ties [g^ . 
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APPENDIX A: SIMILARITY INDEX INVOLVING 
HIGHER-ORDER PATHS 

A straightforward method to extend the local path in- 
dex is to consider the higher-order paths. Such a simi- 
larity index is of the form 

S = A^^eA^^ e^A^ + • • • + e^-^A^, (Al) 



where n > 2 is the maximal order. As shown in Fig. 3, 
the computational complexity in an uncorrelated network 
is 0{N{k)^)^ which grows fast with the increasing of n 
and will exceed the complexity for calculating the Katz 
index for large n. We therefore concentrate on the case 
of n = 4, equivalent to the one shown in Eq. (5). 

As shown in Table IV, the improvements of accuracy 
are not much except for the power grid. Sometimes, to 
introduce higher-order relations will even decrease the 
accuracy, like for US Air and PB. The results are very 
sensitive to the average shortest distances of networks. 
If (d) is very short, to consider paths with length three 
seems enough, and the addition item, e^A^, will make 
little effort (e.g., PPI, NS and INT) or even negative 
effort (e.g., US Air and PB). Only when the network is of 
long average shortest distance, to consider higher-order 
relations may be cost-effective. Since most real networks 
exhibit strongly small- world effect |l|, H, 0, 0, Q , a local 
path index taking into account paths with length no more 
than three may be practically sufficient. 



[1] R. Albert and A.-L. Barabasi, Rev. Mod. Phys. 74, 47 
(2002) . 

[2] S. N. Dorogovtsev and J. F. F. Mendes, Adv. Phys. 51, 
1079 (2002). 

[3] M. E. J. Newman, SIAM Rev. 45, 167 (2003). 

[4] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez and D.- 
U. Huang, Phys. Rep. 424, 175 (2006). 

[5] L. da F. Costa, F. A. Rodrigues, G. Travieso and P. R. 
U. Boas, Adv. Phys. 56, 167 (2007). 

[6] G. Salton and M. J. McGill, Introduction to Modern In- 
formation Retrieval (MuGraw-Hill, Auckland, 1983). 

[7] G. Salton, Automatic text processing: the transforma- 
tion, analysis, and retrival of information by computer 
(Addision- Wesley Longman Publishing Co., Inc., Boston, 
MA, USA, 1989). 

[8] C. D. Manning, P. Raghavan and H. Schiitze, Intro- 
duction to Information Retrieval (Cambridge University 
Press, 2008). 

[9] L. Getoor and C. P. Diehl, Link mining: A survey, in 
Proceeding of the ACM SIGKDD International Confer- 
ence on Knowledge Discovery and Data Mining (ACM 
Press, New York, 2005). 

[10] R. R. Sarukkai, Computer Networks 33, 377 (2000). 

[11] A. Popescul and L. Ungar, Statistical relational learning 
for link prediction, in Workshop on Learning Statistical 
Models from Relational Data (ACM Press, New York, 
2003). 

[12] J. Zhu, J. Hong and J.-G. Hughes, Using Markov chains 
for link prediction in adaptive web sites, Proceedings of 
the thirteenth ACM conference on Hypertext and hyper- 
media, (ACM Press, New York, 2002). 

[13] M. Bilgic, G. M. Namata and L. Getoor, Combining col- 
lective classification and link prediction. Workshop on 
Mining Graphs and Complex Structures at the IEEE In- 
ternational Conference on Data Mining, (2007). 

[14] K. Yu, W. Chu, S. Yu, V. Tresp and Z. Xu, Stochastic 
Relational Models for Discriminative Link Prediction, in 



Advance in Neural Information Processing Systems 19 
(MIT Press, Cambridge, MA, 2007). 

[15] P. Holme and M. Huss, J. R. Soc. Interface 2 (2005) 327. 

[16] B. Gallagher, H. Tong, T. Eliassi-Rad, and C. Falousos, 
Using ghost edges for classification in sparsely labeled net- 
works, in Proceeding of the ACM SICKDD International 
Conference on Knowledge Discovery and Data Mining 
(ACM Press, New York, 2008). 

[17] D. Lin, An information-theoretic definition of similarity, 
in Proceedings of the International Conference on Ma- 
chine Learning, Madison, August, (1998). 

[18] D. Liben-Nowell and J. Kleinberg, J. Am. Soc. Inf. Sci. 
&. Technol. 58, 1019 (2007). 

[19] F. Lorrain and H. C. White, J. Math. Sociol. 1, 49 (1971). 

[20] P. Jaccard, Bulletin de la Societe Vaudoise des Science 
Naturelles 37, 547 (1901). 

[21] L. A. Adamic and E. Adar, Social Networks 25, 211 
(2003). 

[22] A.-L. Barabasi and R. Albert, Science 286, 509 (1999). 

[23] L. Katz, Psychmetrika 18, 39 (1953). 

[24] F. Gobel and A. Jagers, Stochastic Processes and Their 
Applications 2, 311 (1974). 

[25] F. Fouss, A. Pirotte, J.-M. Renders and M. Saerens, 
IEEE Trans. Knowl. Data Eng. 19, 355 (2007). 

[26] S. Brin and L. Page, Computer Networks and ISDN Sys- 
tems 30, 107 (1998). 

[27] G. Jeh and J. Widom, SimRank: A Measure of 
Structural- Context Similarity, in Proceedings of the ACM 
SICKDD International Conference on Knowledge Dis- 
covery and Data Mining (ACM Press, New York, 2002). 

[28] V. D. Blonder, A. Gajardo, M. Heymans, P. Senellart, P. 
V. Dooren, SIAM Rev. 46, 647 (2004). 

[29] E. A. Leicht, P. Holme and M. E. J. Newman, Phys. Rev. 
E 73, 026120 (2006). 

[30] A. Clauset, C. Moore and M. E. J. Newman, Nature 453, 
98 (2008). 

[31] S. Redner, Nature 453, 47 (2008). 



9 



[32] G. Adomavicius and A. Tuzhilin, IEEE Trans. Know, k 
Data Eng. 17, 734 (2005). 

[33] T. Zhou, Personal Recommendation in User-Object Net- 
works^ in J. Zhou (ed.). Complex Sciences (Springer, Ger- 
many, 2009). 

[34] T. Zhou, J. Ren, M. Medo, and Y.-C. Zhang, Phys. Rev. 

E 76, 046115 (2007). 
[35] T. Zhou, L.-L. Jiang, R.-Q. Su, Y.-C. Zhang, Europhys. 

Lett. 81, 58004 (2008). 
[36] Y.-C. Zhang, M. Medo, J. Ren, T. Zhou, T. Li, and F. 

Yang, Europhys. Lett. 80, 68003 (2007). 
[37] Y.-C. Zhang, M. Blattner, and Y.-K. Yu, Phys. Rev. Lett. 

99, 154301 (2007). 
[38] L. Getoor, ACM SIGKDD Explorations Newsletter 5, 84 

(2003). 

[39] J. O'Madadhain, J. Hutchins and P. Smyth, ACM 
SIGKDD Explorations Newsletter 7, 23 (2005). 

[40] M. Rattigan and D. Jensen, ACM SIGKDD Exploration 
Newsletter 7, 41 (2005). 

[41] Z. Huang, X. Li, H. Chen, Link prediction approach 
to collaborative filtering^ In Proceedings of the 5th 
ACM/IEEE-CS Joint Conference on Digital Libraries 
(ACM Press, New York, 2005). 

[42] M. E. J. Newman, Phys. Rev. E 64, 025102 (2001). 

[43] T. S0rensen, Biol. Skr. 5, 1 (1948). 

[44] T. Zhou, L. Lii and Y.-C. Zhang, Eur. Phys. J. B (to be 

published), arXiv: 0901.0553. 
[45] D. Sun, T. Zhou, J.-G. Liu, R.-R. Liu, C.-X. Jia, and 

B.-H. Wang, Phys. Rev. E 80, 017101 (2009). 
[46] D. R. White and K. P. Reitz, Social Networks 5, 193 



(1983). 

[47] J. A. Hanely and B. J. McNeil, Radiology 143, 29 (1982). 

[48] S. Geisser, Predictive inference: An introduction (Chap- 
man and Hall, New York, 1993). 

[49] J. L. Her locker, J. A. Konstan, K. Terveen, and J. T. 
Riedl, ACM Trans. Inf. Syst. 22, 5 (2004). 

[50] G. H. Golub, C. F. Van Loan, Matrix Computation (JHU 
Press, 1996). 

[51] C. von Merging, R. Krause, B. Snel, M. Cornell, S. G. 
Oliver, S. Fields and P. Bork, Nature 417, 399 (2002). 

[52] M. E. J. Newman, Phys. Rev. E 74, 036104 (2006). 

[53] D. J. Watts and S. H. Strogatz, Nature 393, 440 (1998). 

[54] R. Ackland, Mapping the US political blogosphere: 
Are conservative bloggers more prominent, Presenta- 
tion to BlogTalk Downunder, Sydney, 2005, available at 
"http : / / incsub.org/blogtalk / images / robert ackland .pdf , 

[55] N. Spring, R. Mahajan, D. Wetherall and T. Anderson, 
IEEE/ACM Trans. Networking 12, 2 (2004). 

[56] V. Batageli and A. Mrvar, Pajek Datasets, available at 
http: / / vlado.fmf .uni-lj .si / pub / networks / data/default . htm] 

[57] M. E. J. Newman, Phys. Rev. Lett. 89, 208701 (2002J 

[58] J. Liu and D.-S. Deng, Physica A 388, 3643 (2009). 

[59] T. Murata and S. Moriyasu, Link prediction of social net- 
works based on weighted proximity measures, In Proc. 
lEEE/WIC/ACM International Conf Web Intelligence 
(ACM Press, 2007). 

[60] L. Lii and T. Zhou, Role of Weak Ties in Link Predic- 
tion of Complex Networks, In Proc. 18th ACM Conf. Inf. 
Knowl. Management (CIKM'2009, ACM Press, 2009), 
arXiv: 0907.1728. 



